Improved Algorithms For Keyword Extraction and Headline Generation From Unstructured Text

نویسندگان

  • Amit Kumar Mondal
  • Harish Karnick
چکیده

The problem of generating headlines for documents using purely statistical approach has been long standing. We describe here an improved extractive approach based on keywords. The insight here is that if one tries to summarize a document, one will invariably use keywords from the document itself. There are two aspects to the problem namely, finding the relevant set of keywords and finding the proper way to combine these words to reflect a coherent and grammatical headline. Keyword selection can be tackled using various statistics about keywords in the document, while the problem of generating the headline sentence from the selected keywords cannot be best solved without resorting to the knowledge about the semantics of the keywords. To enhance the accuracy of the headline in reflecting the content, the gist or topic is first identified. The results show that our extractive approach is feasible for generating informative headlines.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Keyword Extraction and Headline Generation Using Novel Word Features

We introduce several novel word features for keyword extraction and headline generation. These new word features are derived according to the background knowledge of a document as supplied by Wikipedia. Given a document, to acquire its background knowledge from Wikipedia, we first generate a query for searching the Wikipedia corpus based on the key facts present in the document. We then use the...

متن کامل

Automatic Thai Keyword Extraction from Categorized Text Corpus

Information Extraction (IE) is a process of discovering implicit and potentially important keywords underlying unstructured natural-language text corpus. Most previously proposed solutions to IE were accomplished by constructing a set of words from given text corpus during the preprocessing step. Due to the inherent chracteristic of Thai written language which does not explicitly use any word d...

متن کامل

Extracting information from the text of electronic medical records to improve case detection: a systematic review

BACKGROUND Electronic medical records (EMRs) are revolutionizing health-related research. One key issue for study quality is the accurate identification of patients with the condition of interest. Information in EMRs can be entered as structured codes or unstructured free text. The majority of research studies have used only coded parts of EMRs for case-detection, which may bias findings, miss ...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004